Time-Optimal Top-k Document Retrieval

نویسندگان

  • Gonzalo Navarro
  • Yakov Nekrich
چکیده

Let D be a collection of D documents, which are strings over an alphabet of size σ, of total length n. We describe a data structure that uses linear space and and reports k most relevant documents that contain a query pattern P , which is a string of length p packed in p/ logσ n words, in time O(p/ logσ n+k). This is optimal in the RAM model in the general case where logD = Θ(log n), and involves a novel RAM-optimal suffix tree search. Our construction supports an ample set of important relevance measures, such as the number of times P appears in a document (called term frequency), a fixed document importance, and the minimal distance between two occurrences of P in a document. When logD = o(log n), we show how to reduce the space of the data structure fromO(n log n) to O(n(log σ+ logD+ log log n)) bits, and to O(n(log σ+ logD)) bits in the case of the popular term frequency measure of relevance, at the price of an additive term O(logσ n) in the query time, for any constant ε > 0. We also consider the dynamic scenario, where documents can be inserted and deleted from the collection. We obtain linear space and query time O(p(log log n)/ logσ n+log n+k log log k), whereas insertions and deletions require O(log n) time per symbol, for any constant ε > 0. Finally, we consider an extended static scenario where an extra parameter par(P, d) is defined, and the query must retrieve only documents d such that par(P, d) ∈ [τ1, τ2], where this range is specified at query time. We solve these queries using linear space and O(p/ logσ n + log n+ k log n) time, for any constant ε > 0. Our technique is to translate these top-k problems into multidimensional geometric search problems. As a bonus, we describe some improvements to those problems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Practical Top-K Document Retrieval in Reduced Space

Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k...

متن کامل

Space-Efficient Top-k Document Retrieval

Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k...

متن کامل

Top-K Color Queries for Document Retrieval

In this paper we describe a new efficient (in fact optimal) data structure for the top-K color problem. Each element of an array A is assigned a color c with priority p(c). For a query range [a, b] and a value K, we have to report K colors with the highest priorities among all colors that occur in A[a..b], sorted in reverse order by their priorities. We show that such queries can be answered in...

متن کامل

Top-k document retrieval in optimal space

We present an index for top-k most frequent document retrieval whose space is |CSA|+o(n)+D log n D+O(D) bits, and its query time is O(log k log 2+ n) per reported document, where D is the number of documents, n is the sum of lengths of the documents, and |CSA| is the space of the compressed suffix array for the documents. This improves over previous results for this problem, whose space complex...

متن کامل

Top-k document retrieval in optimal time and linear space

We describe a data structure that uses O(n)-word space and reports k most relevant documents that contain a query pattern P in optimal O(|P | + k) time. Our construction supports an ample set of important relevance measures, such as the frequency of P in a document and the minimal distance between two occurrences of P in a document. We show how to reduce the space of the data structure from O(n...

متن کامل

Ranked Document Selection

Let D be a collection of string documents of n characters in total. The top-k document retrieval problem is to preprocess D into a data structure that, given a query (P, k), can return the k documents of D most relevant to pattern P . The relevance of a document d for a pattern P is given by a predefined ranking function w(P, d). Linear space and optimal query time solutions already exist for t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • SIAM J. Comput.

دوره 46  شماره 

صفحات  -

تاریخ انتشار 2017